University of Amsterdam at THUMOS Challenge 2014
نویسندگان
چکیده
This notebook paper describes our approach for the action classification task of the THUMOS Challenge 2014. We investigate and exploit the action-object relationship by capturing both motion and related objects. As local descriptors we use HOG, HOF and MBH computed along the improved dense trajectories. For video encoding we rely on Fisher vector. In addition, we employ deep net features learned from object attributes to capture action context. All actions are classified with a one-versus-rest linear SVM. Keywords: Action recognition, motion trajectories, deep net features, object attributes 1 Classification framework Our action classification framework consists of two main components: video representation and classification. The video representation is summarized in figure 1. Many of the action classes in the given dataset have related objects such as ‘Billiards’, ‘PlayingTabla’, ‘RockClimbingIndoor’ etc. Therefore, along with motion we also capture the appearance information of object attributes. In the following subsections, we describe these two types of representations and their classification. 1.1 Motion based representation We capture motion information by several local descriptors (HOG, HOF and MBH) computed along the improved trajectories [4]. Improved trajectories is one of the recently proposed approaches that takes into account camera motion compensation, which is shown to be critical in action recognition [1, 4]. To encode the local descriptors, we use Fisher vector. We first apply PCA on these local descriptors and reduce the dimensionality by a factor of two. Then 256,000 descriptors are selected at random from the ‘UCF101’ set and the ‘Background’ set to estimate GMM with K (=256) Gaussians. Each video is then represented by 2DK dimensional Fisher vector, where D is the dimension of descriptors after PCA. Finally, we apply power and L2 normalization to the Fisher vector as done in [2]. 2 Mihir Jain, Jan van Gemert and Cees G. M. Snoek Input video Improved trajectories HOG HOF MBH Fisher vector encoding Input video frames Deep net features computed per frame Averaging over frames HOG HOF MBH Layer-‐6 Layer-‐7 Layer-‐8 Fig. 1. Motion and appearance representations for action recognition in videos. 1.2 Appearance based representation For appearance representation, we employ deep net features. We use the output of the last three fully connected layers of an 8-layer convolutional neural network [3]. The input is raw pixel data, the output are 15K object scores. For the final video representation, we average each of these output vectors across the frames. We refer to these three representations as: Layer-6, Layer-7 and Layer-8. 1.3 Classification and merging representations In all the experiments, we use the SVM with linear kernel for classification. We set C=100 for the SVM and learn 101 one-versus-rest classifiers. To combine two or more of the above described six representations we simply sum the kernels.
منابع مشابه
The THUMOS challenge on action recognition for videos "in the wild"
Automatically recognizing and localizing wide ranges of human actions are crucial for video understanding. Towards this goal, the THUMOS challenge was introduced in 2013 to serve as a benchmark for action recognition. Until then, video action recognition, including THUMOS challenge, had focused primarily on the classification of pre-segmented (i.e., trimmed) videos, which is an artificial task....
متن کاملMindLAB at the THUMOS Challenge
In this notebook paper we describe the MindLAB research group participation at the THUMOS challenge held as part of the ICCV 2013 conference. Two runs were submitted using different methods (SVM, OMF) with the features provided by the challenge organizers (DTF). The performance obtained shows an improvement over the baseline method.
متن کاملExtreme Learning Machine for Large-Scale Action Recognition
In this paper, we describe the method we applied for the action recognition task on the THUMOS 2014 challenge dataset. We study human action recognition in RGB videos through low-level features by focusing on improved trajectory features that are densely extracted from the spatio-temporal volume. We represent each video with Fisher vector encoding and additional mid-level feautures. Finally, we...
متن کاملEfficient Action and Event Recognition in Videos Using Extreme Learning Machines
EFFICIENT ACTION AND EVENT RECOGNITION IN VIDEOS USING EXTREME LEARNING MACHINES A great deal of research in computer vision community has gone into action and event recognition studies. Automatic video understanding for actions are crucial for application areas such as video indexing, surveillance and video summarization. In this thesis, we explore action and event recognition on RGB videos bo...
متن کاملUTS-CMU at THUMOS 2015
This notebook paper describes our solution from UTSCMU team in the THUMOS 2015 action recognition challenge. Our system contains two major components, video representation generated by VLAD encoding from ConvNet features and multi-skip improved Dense Trajectories. In addition, we explore optical flow ConvNet and acoustic features such as MFCC and ASR in our system. We demonstrate that our compl...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014